Skip to content

Conversation

@litianningdatadog
Copy link
Contributor

https://datadoghq.atlassian.net/browse/SVLS-8080

Overview

Merge Lambda Managed Instance feature branch

Testing

Covered by individual commits

duncanista and others added 28 commits December 1, 2025 11:06
`INVOKE` event subscription in elevator would crash
* add `ec2-capacity-provider` init type

needed for elevator mode

* improve debug log for telemetry error while serializing
* update `ReportMetrics` to be an enum to allow `Elevator` metrics

allows us to have a diff to check in the other components

* set metrics for lifecycle given the type of report

* only send log for `OnDemand` metrics

* send correct enhanced metrics given the type of report

* add doc coments

* fmt
route was changed, as opposed to schema version
…ce) mode support with stats generation

https://datadoghq.atlassian.net/browse/SVLS-7584

Implement comprehensive LMI mode support for concurrent Lambda invocations:

Add background periodic flusher for continuous data collection in LMI mode
Implement PlatformReport event handling with proper stats generation
Add LMI mode REPORT log formatting with status, duration, and error details
Integrate StatsGenerator and StatsConcentratorService throughout event pipeline
Add missing stats_generator field to SendingTraceProcessor for both PlatformReport and PlatformRuntimeDone events
Architecture improvements:

Remove InvocationProcessorService wrapper, use Arc<TokioMutex> directly
Simplify event handling by passing stats_concentrator to all event handlers
Add #[must_use] attribute to Listener::new() for better API safety
https://datadoghq.atlassian.net/browse/SVLS-7836?atlOrigin=eyJpIjoiMWNmZTMzOGE4NGEwNDE4MTk5Njk0N2ZmMmU3MzExMjgiLCJwIjoiaiJ9

The extension neither creates SnapStart spans nor emits SnapStart
metrics. This PR adds both.

When a lambda with snapshot enabled is invoked for the first time, we
get `Platform.RestoreStart` and `Platform.RestoreReport`. These
effectively take the place of `Platform.InitStart` and
`Platform.InitReport` events, so our code flow is pretty much identical
to how we handle the cold start span and duration metric.

Note - When a SnapStart instance is restored, we actually receive the
`Platform.InitStart` and `Platform.InitReport` events in addition to the
`Platform.RestoreStart` and `Platform.RestoreReport`. However, the
`Init` events are not from the sandbox starting for that invoke. These
`Init` events are actually generated from when the Snapshot is created.
This is very misleading - You can see that this
[trace](https://ddserverless.datadoghq.com/serverless/aws/lambda?fromUser=false&graphType=flamegraph&group=&highlight=snapstart-java-cdk-function&panel_end=1761860524106&panel_paused=false&panel_start=1761846124106&shouldShowLegend=true&sp=%5B%7B%22p%22%3A%7B%22entityId%22%3A%22aws-lambda-functions%2Bsnapstart-java-cdk-function%2Bus-east-1%2B425362996713%22%7D%2C%22i%22%3A%22lambda-panel%22%7D%2C%7B%22p%22%3A%7B%22traceID%22%3A%225400520227836710313%22%2C%22selectedSpanID%22%3A%22644948261311059067%22%7D%2C%22i%22%3A%22trace-panel%22%7D%5D&spanID=644948261311059067&text_search=snapstart&traceID=5400520227836710313&traceQuery=&start=1761845683104&end=1761860083104&paused=false)
is more than 3 hours long. The lambda was invoked more than 3 hours
after the snapshot version was created. (This is the current
experience).

I deployed my own extension with the changes and confirmed we are now
getting a restore span and not an init span,
[link](https://ddserverless.datadoghq.com/serverless/aws/lambda?fromUser=false&graphType=flamegraph&group=&panel_end=1761860640000&panel_paused=false&panel_start=1761846240000&shouldShowLegend=true&sp=%5B%7B%22p%22%3A%7B%22entityId%22%3A%22aws-lambda-functions%2Bsnapstart-java-function%2Bus-east-1%2B425362996713%22%7D%2C%22i%22%3A%22lambda-panel%22%7D%2C%7B%22p%22%3A%7B%22traceID%22%3A%226634828896084800457%22%2C%22selectedSpanID%22%3A%222017721198037440020%22%7D%2C%22i%22%3A%22trace-panel%22%7D%5D&spanID=2017721198037440020&text_search=snapstart&traceID=6634828896084800457&traceQuery=&start=1761845683104&end=1761860083104&paused=false).
…ce) mode support with stats generation

https://datadoghq.atlassian.net/browse/SVLS-7584

Implement comprehensive LMI mode support for concurrent Lambda invocations:

Add background periodic flusher for continuous data collection in LMI mode
Implement PlatformReport event handling with proper stats generation
Add LMI mode REPORT log formatting with status, duration, and error details
Integrate StatsGenerator and StatsConcentratorService throughout event pipeline
Add missing stats_generator field to SendingTraceProcessor for both PlatformReport and PlatformRuntimeDone events
Architecture improvements:

Remove InvocationProcessorService wrapper, use Arc<TokioMutex> directly
Simplify event handling by passing stats_concentrator to all event handlers
Add #[must_use] attribute to Listener::new() for better API safety
Switch to new value of AWS_LAMBDA_INIT_TYPE
Minor fix to ensure successful local testing.
…LS-7879] (#44)

* ship logs between invocations without request_id

* fmt

* test

* Minor change to prepare for code merge
…el [SVLS-7906] (#47)

* emit fd/threads metrics at shutdown

* pause monitoring on no active invocations

* fmt
* create empty context on init start to be updated on platform start/invoke

* clippy
@litianningdatadog litianningdatadog requested a review from a team as a code owner December 1, 2025 16:34
@litianningdatadog
Copy link
Contributor Author

/merge

@dd-devflow-routing-codex
Copy link

dd-devflow-routing-codex bot commented Dec 1, 2025

View all feedbacks in Devflow UI.

2025-12-01 17:05:22 UTC ℹ️ Start processing command /merge


2025-12-01 17:05:26 UTC ℹ️ MergeQueue: pull request added to the queue

The expected merge time in main is approximately 0s (p90).


2025-12-01 17:06:43 UTC ℹ️ MergeQueue: This merge request was merged

@dd-mergequeue dd-mergequeue bot merged commit 37f3c29 into main Dec 1, 2025
38 of 39 checks passed
@dd-mergequeue dd-mergequeue bot deleted the elevator branch December 1, 2025 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants